/tmp/ipykernel_2208/188905684.py:12: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
  from IPython.core.display import display, HTML

Performance Analysis¶

We'll start by analyzing our best performing model after initial tuning. After ~100-200 runs to select the:

  • best performing scheduler {explain which}
  • optimal set of transformations/augmentations {explain which}

{we've arrived at this configuration by only trying to maximize the high level model performance:

  • {total weighted loss (combined from normalized gender and age prediction loss}
  • gender predicitons accuracy
  • MAE for age predictions

Hello, world!

Gender Prediction Performance¶

Out[18]:
Female Male Overall
Support 2353.000000 2387.000000 4740.000000
Accuracy 0.931013 0.931013 0.931013
Precision 0.924204 0.937925 0.931065
Recall 0.937952 0.924173 0.931062
F1-score 0.931027 0.930998 0.931013
AUC-ROC NaN NaN 0.980522
PR-AUC NaN NaN 0.977997
Log Loss NaN NaN 0.178862
Brier Score NaN NaN NaN
  Cell In[19], line 1
    {TODO: add fancy cofusion matrix
           ^
SyntaxError: invalid syntax. Perhaps you forgot a comma?

Age Predictions¶

Out[20]:
Value
MAE 5.105901
MSE 54.144762
RMSE 7.358312
R-squared 0.862191
MAPE 25.161557
Out[21]:
True Age Predicted Age
Mean 33.308439 32.147823
Median 29.000000 28.514690
Min 1.000000 -2.139822
Max 116.000000 95.214233
Out[24]:
Total Correct Accuracy
Age_Group
0-4 444 307 0.6914
4-14 261 215 0.8238
14-24 636 604 0.9497
24-30 1228 1187 0.9666
30-40 865 837 0.9676
40-50 399 393 0.9850
50-60 420 409 0.9738
60-70 229 218 0.9520
70-80 156 149 0.9551
80+ 102 94 0.9216
Out[23]:
Total Correct Accuracy
Age_Group
0-4 444 306 0.6892
4-14 261 221 0.8467
14-24 636 615 0.9670
24-30 1228 1188 0.9674
30-40 865 846 0.9780
40-50 399 394 0.9875
50-60 420 411 0.9786
60-70 229 223 0.9738
70-80 156 149 0.9551
80+ 102 96 0.9412

We can see that gender prediction accuracy is very high across all ranges except young children. Realistically it's unlikely we can do anything about that, facial features of babies tend to be very different from adults. Potentially it might be worth investigating building a separate model for them but it's unlikely that it would achived very high performance either.

Out[25]:
Age_Group Support Age_MAE Age_MSE Age_RMSE Age_R-squared Age_MAPE
0 0-4 444 1.588580 11.325658 3.365361 -9.241579 99.745904
1 4-14 261 4.011655 34.033093 5.833789 -3.743251 46.700869
2 14-24 636 4.171022 32.965802 5.741585 -2.937213 21.156784
3 24-30 1228 3.720786 30.006521 5.477821 -10.167695 13.674633
4 30-40 865 6.270144 63.924114 7.995256 -7.162335 17.644973
5 40-50 399 7.749943 96.742555 9.835779 -10.194667 16.942367
6 50-60 420 7.311122 91.486462 9.564856 -11.248783 13.271226
7 60-70 229 6.725516 80.393407 8.966237 -8.236708 10.369088
8 70-80 156 7.617475 105.892985 10.290432 -11.530508 10.082188
9 80+ 102 8.947648 173.258202 13.162758 -3.118748 9.777900
Out[26]:
Age_Group Support Age_MAE Age_MSE Age_RMSE Age_R-squared Age_MAPE
0 0-4 444 1.014360 10.863007 3.295908 -8.823212 57.902984
1 4-14 261 3.195415 26.828285 5.179603 -2.739105 39.937090
2 14-24 636 3.587664 25.381818 5.038037 -2.031433 17.745698
3 24-30 1228 4.186014 39.485719 6.283766 -13.695620 15.399167
4 30-40 865 6.002176 59.482523 7.712491 -6.595198 16.917257
5 40-50 399 6.352205 64.919602 8.057270 -6.512241 13.902723
6 50-60 420 6.273703 72.882176 8.537106 -8.757924 11.377895
7 60-70 229 6.505069 69.960602 8.364245 -7.038043 10.017467
8 70-80 156 6.595112 74.531350 8.633154 -7.819429 8.710363
9 80+ 102 8.218197 167.143217 12.928388 -2.973381 8.948656

LIME¶

Solving Age Balancing¶

Out[35]:
<module 'Notebooks.utils.error_analysis' from '/mnt/v/projects/DL_s3/Notebooks/utils/error_analysis.py'>
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
No description has been provided for this image
Figure size: 840x2240 px
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
Out[132]:
['dataset/test_2_folds_last/111_1_0_20170120134646399.jpg.chip.jpg',
 'dataset/test_2_folds_last/1_1_0_20170109194452834.jpg.chip.jpg',
 'dataset/test_2_folds_last/9_0_0_20170110225030430.jpg.chip.jpg',
 'dataset/test_2_folds_last/8_0_1_20170114025855492.jpg.chip.jpg',
 'dataset/test_2_folds_last/41_1_1_20170117021604893.jpg.chip.jpg']

Most Misclassified Images (both gender/age)¶

No description has been provided for this image
Figure size: 840x1400 px
No description has been provided for this image
Figure size: 840x1400 px

Misclassified Gender¶

Looking at gender specifically it's actually likely that our model performs better than the summarized results might imply.

The images above showcases where out model was least accurate, and we can see that all except one are likely cases of data being mislabeled in the original dataset (OR it's labeled accurately based on those individuals self-identity)

No description has been provided for this image
Figure size: 840x1960 px

We can see two main issues:

  1. Some images are poor quality or are strongly cropped. It's possible that we can solve this problem by using heuristics in preprocessing to exclude these samples from trained and test samples.

  2. We can see certain patterns related to race and age. The model is having issue classifying face of people who are non-white, possibly due to different facial features or skin color (although grayscale transform should partially fix that). Also, it's struggling with either very old people or children/babies possibly because of too small sample size and relatively more "androgynous" facial features in those groups. We'll attempt to fix this using augmentation in combination with oversampling (i.e. we'll use transforms to create additional samples for age bins which are underrepresented, additionally we'll use some of the color analysis from the EDA to also oversample the images of under-represented skin colors)

  3. Many samples are potentially mislabeled. It's possible that some of the samples are of people who self-identify as male/female while still retaining facial features, hairstyles etc. of the opposite gender. Or they are just mislabeled. In either case this part would be the hardest to solve.

Filtering Out "Invalid" Samples¶

We'l use a mix of metrics to try and determine which images are very poor quality, lack enough details to proper classification etc. :

BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator):

A no-reference image quality assessment method. Uses scene statistics of locally normalized luminance coefficients to quantify possible losses of "naturalness" in the image due to distortions. Operates in the spatial domain.

Laplacian Variance:

A measure of image sharpness/blurriness. Uses the Laplacian operator to compute the second derivative of the image. Measures the variance of the Laplacian-filtered image.

FFT-based Blur Detection:

Uses Fast Fourier Transform to analyze the frequency components of an image. Applies a high-pass filter in the frequency domain and measures the remaining energy.

See the Data Analysis notebook for more details.

BRISQUE + Laplacian Variance¶

One obvious major shortcoming of this approach is that we're basically excluding a significant proportion of samples basically just because our model performs very poorly on them.

While {TODO}

A production pipeline might be:

  1. Check if image is valid using heuristics (e.g. telling the user to position the camera better etc.)

Augmentation Based Oversampling¶

We'll use augmentation/transforms combined with oversampling to increase the number of samples in underrepresented classes. This approach:

  • allows us to preserve original data characteristics while introducing variability

Potential issues:

  • Risk of overfitting to augmented versions of underrepresented samples
  • Possibility of introducing unintended biases if augmentation isn't carefully balanced
  • May not fully address underlying dataset biases
  • Requires careful monitoring to ensure improved performance across all age groups

Comparing Both Models¶

Let's look at samples that were miss-classified using the initial model but are now correct in the new model:

Out[112]:
image_path true_age age_pred_base age_error_base age_pred_improved age_error_improved error_reduction
4421 80_1_0_20170110131953974.jpg.chip.jpg 80 30.372637 49.627363 68.726158 11.273842 38.353521
3372 46_1_3_20170120140919993.jpg.chip.jpg 46 12.257863 33.742137 46.783295 0.783295 32.958842
3788 55_0_0_20170117204213768.jpg.chip.jpg 55 17.566719 37.433281 44.354115 10.645885 26.787395
2525 34_1_2_20170108224608753.jpg.chip.jpg 34 6.370270 27.629730 33.015022 0.984978 26.644752
2910 38_1_0_20170117154129371.jpg.chip.jpg 38 14.315865 23.684135 37.683617 0.316383 23.367752
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s]
No description has been provided for this image
Figure size: 840x1120 px
No description has been provided for this image
Figure size: 840x1680 px

Of course, we have specifically selected the best case examples (i.e. where the performance of model has improved the most) which probably gives a much to optimistic picture of the overall improvement (relative to overal increase in accuracy/MAE which is not as signficant).

Instead, we've selected some of the samples our initial model failed on that were unlikely to be mislabeled: